January 28, 2016
You might be asking yourself; why use R for spatial analysis when there are commercial and open source Geographical Information Systems (GIS) like ESRI ArcMap and QGIS respectively. These are some of my reasons:
Some of my favorite packages for spatial data analysis include:
We shall use crime data from the Houston Police Department collected over the period of January 2010 - August 2010.
The crime data is in a comma separated value (CSV) format and small in size, only 13MB. It is available in this github respository: https://github.com/SparkIQ-Labs/Demos/tree/master/spatial-analysis/data.
Note: This full data is also available in the ggmap package as the data set "crime". You can load it into your R environment using the command data(crime).
So let's install some of my favorite packages.
## These are some of my favorite packages ## for spatial data analysis suppressPackageStartupMessages(library(ggmap)) suppressPackageStartupMessages(library(sp)) suppressPackageStartupMessages(library(rgdal)) suppressPackageStartupMessages(library(rgeos)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(leaflet)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(magrittr)) suppressPackageStartupMessages(library(readr)) suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(maptools))
crime_df <- read_csv("data/crime.csv")
## Warning: 86314 problems parsing 'data/crime.csv'. See problems(...) for ## more details.
## We need to understand the structure of this dataset str(crime_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 86314 obs. of 17 variables: ## $ time : Date, format: NA NA ... ## $ date : chr "1/1/2010" "1/1/2010" "1/1/2010" "1/1/2010" ... ## $ hour : int 0 0 0 0 0 0 0 0 0 0 ... ## $ premise : chr "18A" "13R" "20R" "20R" ... ## $ offense : chr "murder" "robbery" "aggravated assault" "aggravated assault" ... ## $ beat : chr "15E30" "13D10" "16E20" "2A30" ... ## $ block : chr "9600-9699" "4700-4799" "5000-5099" "1000-1099" ... ## $ street : chr "marlive" "telephone" "wickview" "ashland" ... ## $ type : chr "ln" "rd" "ln" "st" ... ## $ suffix : chr "-" "-" "-" "-" ... ## $ number : int 1 1 1 1 1 1 1 1 1 1 ... ## $ month : chr "january" "january" "january" "january" ... ## $ day : chr "friday" "friday" "friday" "friday" ... ## $ location: chr "apartment parking lot" "road / street / sidewalk" "residence / house" "residence / house" ... ## $ address : chr "9650 marlive ln" "4750 telephone rd" "5050 wickview ln" "1050 ashland st" ... ## $ lon : num -95.4 -95.3 -95.5 -95.4 -95.4 ... ## $ lat : num 29.7 29.7 29.6 29.8 29.7 ... ## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 86314 obs. of 4 variables: ## ..$ row : int 1 2 3 4 5 6 7 8 9 10 ... ## ..$ col : int 1 1 1 1 1 1 1 1 1 1 ... ## ..$ expected: chr "date like %Y-%m-%d" "date like %Y-%m-%d" "date like %Y-%m-%d" "date like %Y-%m-%d" ... ## ..$ actual : chr "2010-01-01T06:00:00Z" "2010-01-01T06:00:00Z" "2010-01-01T06:00:00Z" "2010-01-01T06:00:00Z" ...
Let's take a quick look at some of data records and some summary statistics to get familiar with our data. This can be done using the head() and summary() commands.
head(crime_df, n = 5)
## Source: local data frame [5 x 17] ## ## time date hour premise offense beat block street ## 1 <NA> 1/1/2010 0 18A murder 15E30 9600-9699 marlive ## 2 <NA> 1/1/2010 0 13R robbery 13D10 4700-4799 telephone ## 3 <NA> 1/1/2010 0 20R aggravated assault 16E20 5000-5099 wickview ## 4 <NA> 1/1/2010 0 20R aggravated assault 2A30 1000-1099 ashland ## 5 <NA> 1/1/2010 0 20A aggravated assault 14D20 8300-8399 canyon ## Variables not shown: type (chr), suffix (chr), number (int), month (chr), ## day (chr), location (chr), address (chr), lon (dbl), lat (dbl)
summary(crime_df)
## time date hour premise ## Min. :NA Length:86314 Min. : 0.00 Length:86314 ## 1st Qu.:NA Class :character 1st Qu.: 8.00 Class :character ## Median :NA Mode :character Median :14.00 Mode :character ## Mean :NA Mean :13.28 ## 3rd Qu.:NA 3rd Qu.:19.00 ## Max. :NA Max. :23.00 ## NA's :86314 ## offense beat block ## Length:86314 Length:86314 Length:86314 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## ## street type suffix number ## Length:86314 Length:86314 Length:86314 Min. :1.000 ## Class :character Class :character Class :character 1st Qu.:1.000 ## Mode :character Mode :character Mode :character Median :1.000 ## Mean :1.012 ## 3rd Qu.:1.000 ## Max. :9.000 ## ## month day location ## Length:86314 Length:86314 Length:86314 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## ## address lon lat ## Length:86314 Min. :-99.51 Min. :27.51 ## Class :character 1st Qu.:-95.51 1st Qu.:29.69 ## Mode :character Median :-95.41 Median :29.74 ## Mean :-95.42 Mean :29.76 ## 3rd Qu.:-95.34 3rd Qu.:29.81 ## Max. :-91.95 Max. :37.34 ## NA's :5 NA's :5
The summary statistics reveal that we need to change the format of the variables "date", "offense", "month" and "day" to Date and Factors. We also need to get rid of the 5 data records with NAs in the "coordinates" variable as this will affect some of our spatial analyses.
## Because the sp pacakge is not able to find an inherited method ## for function ‘coordinates’ for signature ‘"tbl_df", ## let's convert our local dataframe. ## The variables "offense", "month", "day" should be factors crime_df <- data.frame(crime_df) %>% filter(lon != "NA") crime_df$offense <- as.factor(crime_df$offense) crime_df$month <- as.factor(crime_df$month) crime_df$day <- as.factor(crime_df$day) crime_df$date <- as.Date(crime_df$date)
In order to leverage the classes and methods in the several spatial packages, including the sp package, we need to convert the "crime_df" local dataframe into "SpatialPointsDataFrame".
## Convert to SpatialPointsDataFrame so as to use spatial packages
## The Cordinate Reference System is a Geographic CRS called WGS84
coords <- SpatialPoints(crime_df[, c("lon", "lat")])
crime_spatial_df <- SpatialPointsDataFrame(coords, crime_df)
proj4string(crime_spatial_df) <- CRS("+proj=longlat +ellps=WGS84")
getClass("Spatial")
## Class "Spatial" [package "sp"] ## ## Slots: ## ## Name: bbox proj4string ## Class: matrix CRS ## ## Known Subclasses: ## Class "SpatialPoints", directly ## Class "SpatialMultiPoints", directly ## Class "SpatialGrid", directly ## Class "SpatialLines", directly ## Class "SpatialPolygons", directly ## Class "SpatialPointsDataFrame", by class "SpatialPoints", distance 2 ## Class "SpatialPixels", by class "SpatialPoints", distance 2 ## Class "SpatialMultiPointsDataFrame", by class "SpatialMultiPoints", distance 2 ## Class "SpatialGridDataFrame", by class "SpatialGrid", distance 2 ## Class "SpatialLinesDataFrame", by class "SpatialLines", distance 2 ## Class "SpatialPixelsDataFrame", by class "SpatialPoints", distance 3 ## Class "SpatialPolygonsDataFrame", by class "SpatialPolygons", distance 2
Now let's take a look at the structure of this new data type.
str(crime_spatial_df)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots ## ..@ data :'data.frame': 86309 obs. of 17 variables: ## .. ..$ time : Date[1:86309], format: NA ... ## .. ..$ date : Date[1:86309], format: "1-01-20" ... ## .. ..$ hour : int [1:86309] 0 0 0 0 0 0 0 0 0 0 ... ## .. ..$ premise : chr [1:86309] "18A" "13R" "20R" "20R" ... ## .. ..$ offense : Factor w/ 7 levels "aggravated assault",..: 4 6 1 1 1 3 3 3 3 3 ... ## .. ..$ beat : chr [1:86309] "15E30" "13D10" "16E20" "2A30" ... ## .. ..$ block : chr [1:86309] "9600-9699" "4700-4799" "5000-5099" "1000-1099" ... ## .. ..$ street : chr [1:86309] "marlive" "telephone" "wickview" "ashland" ... ## .. ..$ type : chr [1:86309] "ln" "rd" "ln" "st" ... ## .. ..$ suffix : chr [1:86309] "-" "-" "-" "-" ... ## .. ..$ number : int [1:86309] 1 1 1 1 1 1 1 1 1 1 ... ## .. ..$ month : Factor w/ 8 levels "april","august",..: 4 4 4 4 4 4 4 4 4 4 ... ## .. ..$ day : Factor w/ 7 levels "friday","monday",..: 1 1 1 1 1 1 1 1 1 1 ... ## .. ..$ location: chr [1:86309] "apartment parking lot" "road / street / sidewalk" "residence / house" "residence / house" ... ## .. ..$ address : chr [1:86309] "9650 marlive ln" "4750 telephone rd" "5050 wickview ln" "1050 ashland st" ... ## .. ..$ lon : num [1:86309] -95.4 -95.3 -95.5 -95.4 -95.4 ... ## .. ..$ lat : num [1:86309] 29.7 29.7 29.6 29.8 29.7 ... ## ..@ coords.nrs : num(0) ## ..@ coords : num [1:86309, 1:2] -95.4 -95.3 -95.5 -95.4 -95.4 ... ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : NULL ## .. .. ..$ : chr [1:2] "lon" "lat" ## ..@ bbox : num [1:2, 1:2] -99.5 27.5 -91.9 37.3 ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : chr [1:2] "lon" "lat" ## .. .. ..$ : chr [1:2] "min" "max" ## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. ..@ projargs: chr "+proj=longlat +ellps=WGS84"
You can see that the class is SpatialPointsDataFrame with 5 slots/components including:
We used the function CRS() to assign a Cordinate Reference System, in this case WGS84. We know that this is a GEOGRAPHIC Coordinate Reference System (CRS) type since the coordinates are in longitute and latitude format.
Alternatively you could use the coordinates() command as follows.
# Or using the "coordinates" method
crime_spatial_df1 <- crime_df
coordinates(crime_spatial_df1) <- c("lon", "lat")
proj4string(crime_spatial_df1) <- CRS("+proj=longlat +ellps=WGS84")
It's important to understand Coordinate Reference Systems (CRS). These are very helpful in geocoding data in space. There are two types of CRS namely:
The Projected Coordinate Reference System consists of several systems categorized into two main classes, namely:
If you see coordinate data that is in the following format, then it's in a Geographic Coordinate Systems.
Datums: Datums are sets of parameters and ground control points defining local coordinate systems. Examples: WGS84, NADS83
We now have a spatial points data frame with the right coordinate system and time data. We should probably save a copy of this as an R data file.
## So we can quickly read in our processed data ## without having to re-process it. saveRDS(crime_spatial_df, "data/crime_spatial_df.rds")
Alternatively we can save our processed data as ESRI shapefiles so as to maintain the spatial integretity, and also if you need to use with other GIS systems. The rgdal package provides writeOGR() command for writing out spatial data types.
## Alternatively create a shapefile of this data
writeOGR(crime_spatial_df, dsn = "data/shapefiles",
layer = "crime-shapefile", driver = "ESRI Shapefile",
overwrite_layer = TRUE)
Another data set that we shall use are administrative census tracts for the Housston area. These data are available as polygon shapefiles from the US Census website.
## Create a SpatialPolygonsDataFrame by reading in shapefile data
unzip("data/shapefiles/tl_2015_48_tract.zip",
exdir = "data/shapefiles",
overwrite = TRUE)
texas_shp <- readOGR(dsn = "data/shapefiles",
layer = "tl_2015_48_tract")
## OGR data source with driver: ESRI Shapefile ## Source: "data/shapefiles", layer: "tl_2015_48_tract" ## with 5265 features ## It has 12 fields
class(texas_shp)
## [1] "SpatialPolygonsDataFrame" ## attr(,"package") ## [1] "sp"
Alternatively we could use the readShapeSpatial() function from the maptools package to read shapefile data.
texas_shp2 <- readShapeSpatial("data/shapefiles/tl_2015_48_tract.shp",
proj4string = CRS("+proj=longlat +datum=WGS84"))
Our shapefile is of the class "SpatialPolygonsDataFrame". We can take quick look at by plotting in using the plot() function.
## Use plot method to plot it plot(texas_shp)
Some foundation methods in R can be used to explore spatial objects like plot(), summary(), print(), etc. For example summary() gives the number of spatial entities, the projection information, and the bounding box, and print() displays a view of the data in the spatial object.
The sp package provides richer methods for manipulating spatial objects.
A method for exploring the bounding area of any spatial object is the bbox() method. The first row reports the west–east range and the second the south–north direction.
bbox(crime_spatial_df)
## min max ## lon -99.50555 -91.94627 ## lat 27.50711 37.33690
We can explore the projection system of any spatial object using the proj4string() method. This method can also be used to assign a different coordinate system to a spatial object if need be. This can be done as follows:
proj4string(crime_spatial_df)
## [1] "+proj=longlat +ellps=WGS84"
We can explore/extract the individual slots in our spatial points data frame by using the "@" symbol instead of the "$" symbol. For example, let's look at the data and coordinate slots:
# Explore the SpatialPointsDataFrame head(crime_spatial_df@data)
## time date hour premise offense beat block street ## 1 <NA> 1-01-20 0 18A murder 15E30 9600-9699 marlive ## 2 <NA> 1-01-20 0 13R robbery 13D10 4700-4799 telephone ## 3 <NA> 1-01-20 0 20R aggravated assault 16E20 5000-5099 wickview ## 4 <NA> 1-01-20 0 20R aggravated assault 2A30 1000-1099 ashland ## 5 <NA> 1-01-20 0 20A aggravated assault 14D20 8300-8399 canyon ## 6 <NA> 1-01-20 0 20R burglary 18F60 9300-9399 rowan ## type suffix number month day location ## 1 ln - 1 january friday apartment parking lot ## 2 rd - 1 january friday road / street / sidewalk ## 3 ln - 1 january friday residence / house ## 4 st - 1 january friday residence / house ## 5 - 1 january friday apartment ## 6 ln - 1 january friday residence / house ## address lon lat ## 1 9650 marlive ln -95.43739 29.67790 ## 2 4750 telephone rd -95.29888 29.69171 ## 3 5050 wickview ln -95.45586 29.59922 ## 4 1050 ashland st -95.40334 29.79024 ## 5 8350 canyon -95.37791 29.67063 ## 6 9350 rowan ln -95.54830 29.70223
head(crime_spatial_df@coords, 4)
## lon lat ## [1,] -95.43739 29.67790 ## [2,] -95.29888 29.69171 ## [3,] -95.45586 29.59922 ## [4,] -95.40334 29.79024
# Restrict the data to downtown only
downtown_crime <- subset(crime_df,
-95.39681 <= lon & lon <= -95.34188 &
29.73631 <= lat & lat <= 29.78400)
R has several packages for visualizing spatial data. We shall look at tradition plotting systems in R that come with an R installation, commonly named "base-R" packages. An example is the plot() function for spatial data in the sp package. We shall also explore "external" packages for doing including ggplot2, ggmap, and leaflet.
By adding layers incrementally. Let's start by plotting the shapefiles of Houston Texas.
plot(texas_shp, col = "grey", axes = TRUE)
Then let's add crime data onto the map. Use the argument add = TRUE to add another layer onto the plot.
plot(texas_shp, col = "grey", axes = TRUE) plot(crime_spatial_df, pch = 21, bg = "red", cex = .5, add = TRUE)
Then add a title and a legend.
plot(texas_shp, col = "grey", axes = TRUE)
plot(crime_spatial_df, pch = 21, bg = "red", cex = .5, add = TRUE)
title("Locations of Offensive Crimes in Houston, Texas")
legend("topleft", title = "Legend", legend = "Crime Locations",
pch = 21, pt.bg = "red", bty = "n")
While the base functions for plotting spatial data are sufficient enough to produce maps, we would love to explore "external" R libraries that provide more capabilities and produce beautiful maps like ggplot2, ggmap, leaflet.
ggplot2 works with data frames and not objects of class Spatial*. So we have to convert them using the fortify() function from ggplot2.
crime_df <- data.frame(crime_spatial_df) texas_shp_df <- fortify(texas_shp)
## Regions defined for each Polygons
Now we can leverage ggplot2's powerful plotting capabilities.
p <- ggplot() +
geom_polygon(data = texas_shp_df, aes(x=long, y=lat, group = group)) + coord_equal() +
geom_point(data = crime_df, aes(x = lon, y = lat, color = "red")) +
labs(title = "Locations of Offensive Crimes in Houston, Texas") +
xlab("Longitude") +
ylab("Latitude")
p
Create the static background layer for the city of Houston, Texas
## Use "get_map" command to download the images map_dat <- get_map(location = "houston", source = "osm", zoom = 14)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=houston&zoom=14&size=640x640&scale=2&maptype=terrain&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=houston&sensor=false
Use "ggmap" command to make the plot
houstonMap <- ggmap(map_dat, extent = "device", legend = "topleft") houstonMap
Plot the shapefiles
houstonMap2 <- houstonMap +
geom_polygon(aes(x = long, y = lat, group = group), data = texas_shp_df,
colour = "black",
alpha = .4, size = .3)
houstonMap2
Geocode the downtown crime data using longitute and latitute variables
houstonMap3 <- houstonMap2 +
geom_point(aes(x = lon, y = lat), data = downtown_crime, alpha = 0.5,
color="darkred", size = 3)
houstonMap3
## Warning: Removed 42 rows containing missing values (geom_point).
Leaflet for R is a package that provides functions to develop elegant and beautiful Leaflet maps. Leaflet is the leading open-source JavaScript library for interactive maps.
Let's first start by creating a map of the the houston area and then adding layers (features). This can be done using the leaflet() command. The first layer that we should add are the "Tiles" using the addTiles command.
m <- leaflet() %>% setView(lng = -95.3698028, lat = 29.7604267, zoom = 12) %>% addTiles() m
Leaflet Map
Then add the shapefile data onto the map. This is added as another layer, polygon, using the addPolygon command.
m1 <- m %>% addPolygons(data = texas_shp, fillColor = "transparent",
color = "black", weight = 1)
m1
Leaflet Polygons
Finally add the crime data onto the map. This is added as another layer, points, using the addMarkers command.
m2 <- m %>% addCircles(data = crime_spatial_df, weight = 1.5, color = "red") m2
Leaflet Points
In another meetup, we shall see how to perform geostatistical analysis on spatial data.
For further reading, refer to the following references.